15 research outputs found

    Autoencoders for natural language semantics

    Full text link
    Les auto-encodeurs sont des rĂ©seaux de neurones artificiels qui apprennent des reprĂ©sentations. Dans un auto-encodeur, l’encodeur transforme une entrĂ©e en une reprĂ©sentation, et le dĂ©codeur essaie de prĂ©dire l’entrĂ©e Ă  partir de la reprĂ©sentation. Cette thĂšse compile trois applications de ces modĂšles au traitement automatique des langues : pour l’apprentissage de reprĂ©sentations de mots et de phrases, ainsi que pour mieux comprendre la compositionnalitĂ©. Dans le premier article, nous montrons que nous pouvons auto-encoder des dĂ©finitions de dictionnaire et ainsi apprendre des vecteurs de dĂ©finition. Nous proposons une nouvelle pĂ©nalitĂ© qui nous permet d’utiliser ces vecteurs comme entrĂ©es Ă  l’encodeur lui-mĂȘme, mais aussi de les mĂ©langer des vecteurs distributionnels prĂ©-entraĂźnĂ©s. Ces vecteurs de dĂ©finition capturent mieux la similaritĂ© sĂ©mantique que les mĂ©thodes distributionnelles telles que word2vec. De plus, l’encodeur gĂ©nĂ©ralise Ă  un certain degrĂ© Ă  des dĂ©finitions qu’il n’a pas vues pendant l’entraĂźnement. Dans le deuxiĂšme article, nous analysons les reprĂ©sentations apprises par les auto-encodeurs variationnels sĂ©quence-Ă -sĂ©quence. Nous constatons que les encodeurs ont tendance Ă  mĂ©mo- riser les premiers mots et la longueur de la phrase d’entrĂ©e. Cela limite considĂ©rablement leur utilitĂ© en tant que modĂšles gĂ©nĂ©ratifs contrĂŽlables. Nous analysons aussi des variantes architecturales plus simples qui ne tiennent pas compte de l’ordre des mots, ainsi que des mĂ©- thodes basĂ©es sur le prĂ©-entraĂźnement. Les reprĂ©sentations qu’elles apprennent ont tendance Ă  encoder plus nettement des caractĂ©ristiques globales telles que le sujet et le sentiment, et cela se voit dans les reconstructions qu’ils produisent. Dans le troisiĂšme article, nous utilisons des simulations d’émergence du langage pour Ă©tudier la compositionnalitĂ©. Un locuteur – l’encodeur – observe une entrĂ©e et produit un message. Un auditeur – le dĂ©codeur – tente de reconstituer ce dont le locuteur a parlĂ© dans son message. Nous Ă©mettons l’hypothĂšse que faire des phrases impliquant plusieurs entitĂ©s, telles que « Jean aime Marie », nĂ©cessite fondamentalement de percevoir chaque entitĂ© comme un tout. Nous dotons certains agents de cette capacitĂ© grĂące Ă  un mechanisme d’attention, alors que d’autres en sont privĂ©s. Nous proposons diffĂ©rentes mĂ©triques qui mesurent Ă  quel point les langues des agents sont naturelles en termes de structure d’argument, et si elles sont davantage analytiques ou synthĂ©tiques. Les agents percevant les entitĂ©s comme des touts Ă©changent des messages plus naturels que les autres agents.Autoencoders are artificial neural networks that learn representations. In an autoencoder, the encoder transforms an input into a representation, and the decoder tries to recover the input from the representation. This thesis compiles three different applications of these models to natural language processing: for learning word and sentence representations, as well as to better understand compositionality. In the first paper, we show that we can autoencode dictionary definitions to learn word vectors, called definition embeddings. We propose a new penalty that allows us to use these definition embeddings as inputs to the encoder itself, but also to blend them with pretrained distributional vectors. The definition embeddings capture semantic similarity better than distributional methods such as word2vec. Moreover, the encoder somewhat generalizes to definitions unseen during training. In the second paper, we analyze the representations learned by sequence-to-sequence variational autoencoders. We find that the encoders tend to memorize the first few words and the length of the input sentence. This limits drastically their usefulness as controllable generative models. We also analyze simpler architectural variants that are agnostic to word order, as well as pretraining-based methods. The representations that they learn tend to encode global features such as topic and sentiment more markedly, and this shows in the reconstructions they produce. In the third paper, we use language emergence simulations to study compositionality. A speaker – the encoder – observes an input and produces a message about it. A listener – the decoder – tries to reconstruct what the speaker talked about in its message. We hypothesize that producing sentences involving several entities, such as “John loves Mary”, fundamentally requires to perceive each entity, John and Mary, as distinct wholes. We endow some agents with this ability via an attention mechanism, and deprive others of it. We propose various metrics to measure whether the languages are natural in terms of their argument structure, and whether the languages are more analytic or synthetic. Agents perceiving entities as distinct wholes exchange more natural messages than other agents

    DART: a Dataset of Arguments and their Relations on Twitter

    Get PDF
    International audienceThe problem of understanding the stream of messages exchanged on social media such as Facebook and Twitter is becoming a major challenge for automated systems. The tremendous amount of data exchanged on these platforms as well as the specific form of language adopted by social media users constitute a new challenging context for existing argument mining techniques. In this paper, we describe a resource of natural language arguments called DART (Dataset of Arguments and their Relations on Twitter) where the complete argument mining pipeline over Twitter messages is considered: (i) we identify which tweets can be considered as arguments and which cannot, and (ii) we identify what is the relation, i.e., support or attack, linking such tweets to each other

    Learning GFlowNets from partial episodes for improved convergence and stability

    Full text link
    Generative flow networks (GFlowNets) are a family of algorithms for training a sequential sampler of discrete objects under an unnormalized target density and have been successfully used for various probabilistic modeling tasks. Existing training objectives for GFlowNets are either local to states or transitions, or propagate a reward signal over an entire sampling trajectory. We argue that these alternatives represent opposite ends of a gradient bias-variance tradeoff and propose a way to exploit this tradeoff to mitigate its harmful effects. Inspired by the TD(λ\lambda) algorithm in reinforcement learning, we introduce subtrajectory balance or SubTB(λ\lambda), a GFlowNet training objective that can learn from partial action subsequences of varying lengths. We show that SubTB(λ\lambda) accelerates sampler convergence in previously studied and new environments and enables training GFlowNets in environments with longer action sequences and sparser reward landscapes than what was possible before. We also perform a comparative analysis of stochastic gradient dynamics, shedding light on the bias-variance tradeoff in GFlowNet training and the advantages of subtrajectory balance.Comment: ICML 202

    Anti-tumour necrosis factor discontinuation in inflammatory bowel disease patients in remission: study protocol of a prospective, multicentre, randomized clinical trial

    Get PDF
    Background: Patients with inflammatory bowel disease who achieve remission with anti-tumour necrosis factor (anti-TNF) drugs may have treatment withdrawn due to safety concerns and cost considerations, but there is a lack of prospective, controlled data investigating this strategy. The primary study aim is to compare the rates of clinical remission at 1?year in patients who discontinue anti-TNF treatment versus those who continue treatment. Methods: This is an ongoing, prospective, double-blind, multicentre, randomized, placebo-controlled study in patients with Crohn?s disease or ulcerative colitis who have achieved clinical remission for ?6?months with an anti-TNF treatment and an immunosuppressant. Patients are being randomized 1:1 to discontinue anti-TNF therapy or continue therapy. Randomization stratifies patients by the type of inflammatory bowel disease and drug (infliximab versus adalimumab) at study inclusion. The primary endpoint of the study is sustained clinical remission at 1?year. Other endpoints include endoscopic and radiological activity, patient-reported outcomes (quality of life, work productivity), safety and predictive factors for relapse. The required sample size is 194 patients. In addition to the main analysis (discontinuation versus continuation), subanalyses will include stratification by type of inflammatory bowel disease, phenotype and previous treatment. Biological samples will be obtained to identify factors predictive of relapse after treatment withdrawal. Results: Enrolment began in 2016, and the study is expected to end in 2020. Conclusions: This study will contribute prospective, controlled data on outcomes and predictors of relapse in patients with inflammatory bowel disease after withdrawal of anti-TNF agents following achievement of clinical remission. Clinical trial reference number: EudraCT 2015-001410-1
    corecore